How I lost 1000€ betting on CS:GO — Practice

Part 2 of a series of 2 posts where I explore how I lost 1000 euros betting on CS:GO with machine learning (ML). This post covers the actual implementation of the solution: CS:GO, feature engineering, modelling, validation, backtesting and lessons learned.

Author

Pedro Tabacof

Published

July 11, 2024

Check out the first post of the series here, which covers the theory and foundations necessary to understand what’s going on in this second post.

This is a true story of how I lost money using machine learning (ML) to bet on CS:GO. The original idea and implementation came from a friend, who gave me permission to share this story in public.

In this post, I will go over the actual implementation of the solution:

Solution

CS:GO

Counter-Strike: Global Offensive (CS:GO) is a first-person shooter (FPS) multiplayer game. It can be played casually or competitively. When played competitively, it’s typically played on the following format:

  • Two teams of 5 play against each other: Terrorists vs Counter-Terrorists
  • Best of 3 maps (sometimes best of 1 or 5)
  • Maps are played up to 30 rounds
  • Each round can be won be killing the other team or by planting or defusing the bomb
  • Each player has a number of kills (K), deaths (D), assists (A) and average damage per round (ADR)

If you don’t know much about videogames, don’t worry, you can treat CS:GO as any other team sport.

CS:GO score screenshot from https://blog.scope.gg/cs-go-stats-why-is-it-so-important-en/

Web scraping

Data is the new oil.

As I explained in the first post, one of the reasons we chose CS:GO was data availability. Since we broke some term and conditions, I won’t name our exact sources, but they were easily found online.

We collected both match data and betting odds. Note that match data is easy to find retroactively, but betting odds need to be collected in real-time, which limited our ability to run backtests (more on this later).

We collected 3 years worth of match data over 30k matches. We managed to collect only 3 months of betting odds data, covering 1725 matches, with 30 odds per match.

Match data contained information such as the teams playing, team composition, kills and deaths for each player, rounds won, the map to be played, and the final score (win-loss-tie).

To scrape the data that we needed we used Selenium through a headless browser. Then, we parsed the resulting HTML with BeautifulSoup.

Feature engineering

Past behavior is the best predictor of future behavior.

With the match data, we created 100s of features. Most features were related to past performance, such as the percentage of times team 1 won on the map to be played or against team 2. That is, if the teams faced each other off before, who won back then is an important predictor now. We also used game score features like KD difference and ADR on a team and individual basis.

Note that we couldn’t use the betting odds as features, even though the information there is invaluable (see this work which shows there is alpha by just averaging betting odds from different bookmakers). The reason is simple, as previously explained: we didn’t have a backfill for historical betting odds. We could only use the odds that were available after we started to collect them, which was only (barely) enough for backtesting.

While we built the features manually with some code automation, I’d suggest using Featuretools nowadays.

Also, we had a trump card, which ended up being the most important feature: TrueSkill.

TrueSkill

TrueSkill is a Bayesian skill rating system developed by Microsoft for multiplayer games. It aims to estimate the “true skill” of each player or team based on their performance history. The model uses a Gaussian distribution to represent the skill level of each player, and it updates these skill levels after each match using Bayesian updates. TrueSkill provides not just the ability but also the uncertainty around each player’s skill, both of which can be used as features in a ML model.

For another post focused on Bayesian models, check out How (not) to forecast an election.

Inferential vs predictive models

There are two cultures in the use of statistical modeling to reach conclusions from data. -Leo Breiman1

If we have TrueSkill, which predicts the win probability between two teams, why do we even need a ML model? TrueSkill is an inferential model, which attempts to explain the world through latent variables. Of course, a perfect model of the world would also make great predictions but, in practice, there is always a trade-off between explainability and predictive power. That is the biggest tension between statistics and ML.

ML models are typically less interpretable black boxes but much more powerful at making predictions. They can incorporate a wide range of features, including but not limited to those provided by TrueSkill, with the sole focus of opmitising a loss function.

Dataset

Here is the matches dataset with all the features, including TrueSkill, and target together. I don’t show the actual feature engineering calculation for the sake of brevity, as this post is long enough as it is.

df = pd.read_parquet("dataset.parquet")

Modelling

XGBoost is all you need. -Bojan Tunguz

The modelling done here is pretty standard with a couple of notable exceptions:

  1. We remove ties, which represent roughly 1.5% of the dataset
  2. We do data augmentation by swapping team1 and team2 features and adding both rows to the training set
    • We can do that as there is no “home advantage” in CS:GO like there is in football
    • When making predictions, we average the predictions across both scenarios

Out-of-time train-test split

We use out-of-time split instead of the more typical cross-validation. In pretty much any real-life application, a model is trained with past data and ends up used to predict future unseen data. Your evaluation should reflect that, as you might be interested to know how your model performance degrades over time (which could be caused, for example, by concept drift).

If you intend to re-train your model periodically, say, weekly, you could evaluate that strategy by simply using a time-series cross validation, where you train the model with data up to week X and predict on data of week X+1.

dataset = pd.read_parquet("dataset.parquet").drop_duplicates()

dt_train = '2019-01-01'
dt_test = '2019-08-01'

dataset['target'] = (dataset['winner'] == 'team1').astype(bool)
dataset = dataset[
    (dataset['match_date'] >= '2017-01-01') &
    (dataset['winner'] != 'tie') &
    (dataset['match_id'] != 'https://www.hltv.org/matches/2332976/lucid-dream-vs-alpha-red-esl-pro-league-season-9-asia')
].reset_index(drop=True)

mask_train = dataset['match_date'] < dt_train
dataset_train = dataset.loc[mask_train].reset_index(drop=True)

dataset_train2 = dataset_train.sample(frac=1).reset_index(drop=True)
dataset_train2['target'] = ~dataset_train2['target']
cols = []

# Swapping team features
for c in list(dataset_train.columns):
    if c.startswith('team1_'):
        cols.append(c.replace('team1_', 'team2_').replace('_team2', '_team1'))
    elif c.startswith('team2_'):
        cols.append(c.replace('team2_', 'team1_').replace('_team1', '_team2'))
    else:
        cols.append(c)

dataset_train2 = dataset_train2.rename(columns=dict(zip(dataset_train.columns, cols)))
dataset_train = dataset_train[cols]
dataset_train2 = dataset_train2[cols]
dataset_train = pd.concat([dataset_train, dataset_train2], axis=0, ignore_index=True).reset_index(drop=True)

idxs = np.random.choice(len(dataset_train), replace=False, size=4000)
dataset_val = dataset_train.loc[idxs].drop_duplicates('match_id').reset_index(drop=True)
dataset_val = dataset_val.reset_index(drop=True)

index = np.arange(len(dataset_train))
mask = ~np.in1d(index, idxs)
dataset_train = dataset_train.loc[mask].reset_index(drop=True)

mask_test = (
    (dataset['match_date'] >= dt_train) &
    (dataset['match_date'] < dt_test)
)
dataset_test = dataset.loc[mask_test].reset_index(drop=True)
dataset_train.shape, dataset_val.shape, dataset_test.shape
((38490, 267), (3802, 267), (4759, 267))
Code
dataset['match_date'] = pd.to_datetime(dataset['match_date'])

# Create weekly match counts
match_counts = dataset.groupby(dataset['match_date'].dt.to_period('W')).size().reset_index(name='count')
match_counts['match_date'] = match_counts['match_date'].dt.to_timestamp()

# Define color for each period
match_counts['period'] = 'Train'
match_counts.loc[match_counts['match_date'] >= pd.to_datetime(dt_train), 'period'] = 'Test'

# For validation, we'll consider it as part of the train set but with a different color
val_mask = dataset_val['match_date'].dt.to_period('W').value_counts().reset_index()
val_mask.columns = ['match_date', 'val_count']
match_counts = match_counts.merge(val_mask, on='match_date', how='left')
match_counts['val_count'] = match_counts['val_count'].fillna(0)
match_counts.loc[match_counts['val_count'] > 0, 'period'] = 'Validation'

# Create the plot
fig = px.line(match_counts, x='match_date', y='count', color='period', 
              title='Number of Matches Over Time (Weekly)',
              labels={'count': 'Number of Matches', 'match_date': 'Date'},
              color_discrete_map={'Train': 'blue', 'Validation': 'green', 'Test': 'red'})

# Add vertical lines for train/test split
fig.add_vline(x=dt_train, line_dash="dash", line_color="gray")

# Add annotation for the train/test split
fig.add_annotation(x=dt_train, y=1, yref="paper", showarrow=False,
                   text="Train/Test Split", textangle=-90, xanchor="right")

# Update layout for better readability
fig.update_layout(
    legend_title_text='Dataset',
    xaxis_title="Date",
    yaxis_title="Number of Matches per Week",
)

fig.show()

Model: LightGBM

We use a standard off-the-shelf LightGBM binary classifier. There are many advantages to use LGBM or XGBoost for tabular data problems (either choice is fine!):

  • Handles missing values natively
  • Handles categorical features natively
  • Early stopping to optimize the number of estimators
  • Blazing fast and scalable
  • Multiple loss functions options, including using a custom one
    • For binary classification, the default is the negative logloss (a proper scoring rule, which should lead to well-calibrated probabilities)

For more information on how to unlock the power of LightGBM, watch my PyData London 2022 presentation.

from lightgbm.callback import early_stopping, log_evaluation
class CSGOPredictor(object):
    def __init__(self, model_params):
        self.model_params = model_params

    def fit(self, x_train, y_train, x_val, y_val):
        self.lgb = LGBMClassifier(**self.model_params)
        self.lgb.fit(
            x_train, y_train,
            eval_set=[(x_train, y_train), (x_val, y_val)],
            eval_names=['training', 'validation'],
            callbacks=[
                early_stopping(stopping_rounds=50),
                log_evaluation(period=25),  # Log every 25 iterations
            ]
        )
        return self

    def predict_proba(self, x):
        # Predictions are done twice and then averaged, with swapped team features
        original = self.lgb.predict_proba(x)

        x_inv = x.copy()
        team1_cols = [i for i in x_inv.columns if i.startswith('team1')]
        team2_cols = [i for i in x_inv.columns if i.startswith('team2')]

        x_inv = x_inv.rename(dict(zip(team1_cols + team2_cols, team2_cols + team1_cols)), axis=1)
        x_inv = x_inv.reindex(columns=x.columns)

        inv = self.lgb.predict_proba(x_inv)

        inv[:, 0], inv[:, 1] = inv[:, 1], inv[:, 0].copy()

        return (original+inv)/2.0
    
    def predict(self, x):
        return self.predict_proba(x).argmax(axis=1)
Code
drop_cols = ['winner',  'match_date', 'match_id', 'event_id', 'team1_id', 'team2_id', 'target']

x_train = dataset_train.drop(columns=drop_cols, axis=1)
y_train = dataset_train['target']
features = list(x_train.columns)

x_val = dataset_val[features]
y_val = dataset_val['target']

x_test = dataset_test[features]
y_test = dataset_test['target']

model_params = {
    'n_estimators': 10_000,
    'learning_rate': 0.05
}

model = CSGOPredictor(model_params).fit(x_train, y_train, x_val, y_val)
Training until validation scores don't improve for 50 rounds
[25]    training's binary_logloss: 0.589321 validation's binary_logloss: 0.604284
[50]    training's binary_logloss: 0.562941 validation's binary_logloss: 0.588552
[75]    training's binary_logloss: 0.547649 validation's binary_logloss: 0.584521
[100]   training's binary_logloss: 0.535844 validation's binary_logloss: 0.583577
[125]   training's binary_logloss: 0.525761 validation's binary_logloss: 0.5834
[150]   training's binary_logloss: 0.515957 validation's binary_logloss: 0.582886
[175]   training's binary_logloss: 0.506858 validation's binary_logloss: 0.582916
[200]   training's binary_logloss: 0.49788  validation's binary_logloss: 0.58255
[225]   training's binary_logloss: 0.488956 validation's binary_logloss: 0.581818
[250]   training's binary_logloss: 0.48075  validation's binary_logloss: 0.582284
[275]   training's binary_logloss: 0.472855 validation's binary_logloss: 0.581823
[300]   training's binary_logloss: 0.46497  validation's binary_logloss: 0.582013
Early stopping, best iteration is:
[271]   training's binary_logloss: 0.474098 validation's binary_logloss: 0.581628

Feature importance

Here is the “beeswarm” view of SHAP values. It shows not just the importance but also how each feature relates to the prediction logits:

explainer = shap.Explainer(model.lgb)
shap_values = explainer(x_test)
shap.plots.beeswarm(shap_values, max_display=20)

Unsurprisingly, the TrueSkill win probability features are the most important ones. In a sense, this can be seen as a form of stacking, since TrueSkill is another model. Other important features relate to the team’s past performance, like KD ratio and ADR.

Are ~250 features really necessary? Probably not, especially with just 30k samples2. We didn’t do any feature selection, but I’d do permutation importance and adversarial validation on a time split with more time on my hands3.

Evaluation

We evaluate using the following metrics:

  • Accuracy: how many bets you expect to get right
  • AUC4: how well you rank-order the winners/losers
  • Brier score: a metric takes both calibarion and accuracy into account

I also plot the calibration curves for the training and test sets.

Code
def calculate_metrics(X, y, model):
    y_pred_proba = model.predict_proba(X)[:, 1]
    y_pred = model.predict(X)
    return {
        'Accuracy': accuracy_score(y, y_pred),
        'AUC': roc_auc_score(y, y_pred_proba),
        'Brier_score': brier_score_loss(y, y_pred_proba)
    }

metrics_train = calculate_metrics(x_train, y_train, model)
metrics_val = calculate_metrics(x_val, y_val, model)
metrics_test = calculate_metrics(x_test, y_test, model)

metrics_df = pd.DataFrame([metrics_train, metrics_val, metrics_test],
                          index=['Training', 'Validation', 'Test'])
metrics_df
Accuracy AUC Brier_score
Training 0.776955 0.867345 0.155439
Validation 0.734087 0.816379 0.176430
Test 0.704980 0.771550 0.191545
Code
def plot_calibration_curve(y_true, y_pred_proba, set_name, fig, color):
    mean_predicted_value, fraction_of_positives = calibration_curve(y_true, y_pred_proba, n_bins=10)
    fig.add_trace(go.Scatter(
        x=mean_predicted_value, y=fraction_of_positives,
        mode='lines+markers', name=f'{set_name} set',
        line=dict(color=color)
    ))

# Create a new figure for the calibration plot
calibration_fig = go.Figure()

# Add the perfectly calibrated line
calibration_fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines', name='Perfectly calibrated',
    line=dict(dash='dot')
))

# Plot calibration curve for the training set
plot_calibration_curve(y_train, model.predict_proba(x_train)[:, 1], 'Training', calibration_fig, 'blue')

# Plot calibration curve for the test set
plot_calibration_curve(y_test, model.predict_proba(x_test)[:, 1], 'Test', calibration_fig, 'red')

# Set layout properties for the calibration plot
calibration_fig.update_layout(
    title="Calibration plot",
    xaxis_title="Mean predicted value",
    yaxis_title="Fraction of positives",
    xaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
    yaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
    showlegend=True
)

calibration_fig.show()

The model seems well calibrated, which makes it useful for betting: recall from the previous post that our betting decision rule is based on the probability of team 1 or 2 winning.

If the model wasn’t well calibrated, we could use Isotonic regression on a validation set to fix calibration issues. There are other options for post-hoc model calibration like Platt scaling, but Isotonic regression works best for tree-based models.

Code
def auc_over_time(df, model, date_col, target_col, features):
    # Make a copy to avoid modifying the original dataframe and convert match_date to datetime
    weekly_df = df.copy()
    weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])

    # Create a 'week_start_date' column for grouping that represents the start of the week
    weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)

    # Initialize a dictionary to store AUC for each week
    weekly_auc = {}

    for week_start_date, group in weekly_df.groupby('week_start_date'):
        if not group.empty:
            X = group[features]
            y = group[target_col]
            auc = roc_auc_score(y, model.predict_proba(X)[:, 1])
            weekly_auc[week_start_date] = auc

    return pd.Series(weekly_auc)

def acc_over_time(df, model, date_col, target_col, features):
    # Make a copy to avoid modifying the original dataframe and convert match_date to datetime
    weekly_df = df.copy()
    weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])

    # Create a 'week_start_date' column for grouping that represents the start of the week
    weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)

    # Initialize a dictionary to store AUC for each week
    weekly_auc = {}

    for week_start_date, group in weekly_df.groupby('week_start_date'):
        if not group.empty:
            X = group[features]
            y = group[target_col]
            auc = accuracy_score(y, model.predict(X))
            weekly_auc[week_start_date] = auc

    return pd.Series(weekly_auc)
Code
# Calculate weekly AUC for training and test sets
weekly_auc_train = auc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_auc_test = auc_over_time(dataset_test, model, 'match_date', 'target', features)

# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
    x=weekly_auc_train.index,
    y=weekly_auc_train.values,
    mode='lines+markers',
    name='Training Set',
    line=dict(color='blue')
)

trace1 = go.Scatter(
    x=weekly_auc_test.index,
    y=weekly_auc_test.values,
    mode='lines+markers',
    name='Test Set',
    line=dict(color='red')
)

layout = go.Layout(
    title='AUC Over Time',
    xaxis=dict(title='Week Start Date'),
    yaxis=dict(title='AUC'),
    showlegend=True
)

fig = go.Figure(data=[trace0, trace1], layout=layout)

fig.add_hline(y=0.5, line_dash="dash", line_color="black",
              annotation_text="Random prediction", annotation_position="bottom right")

avg_train_auc = weekly_auc_train.mean()
avg_test_auc = weekly_auc_test.mean()

# Training set average line for the training period
fig.add_shape(type='line',
              x0=weekly_auc_train.index.min(), y0=avg_train_auc,
              x1=weekly_auc_train.index.max(), y1=avg_train_auc,
              line=dict(dash='dash', color='blue', width=2),
              xref='x', yref='y')

# Test set average line for the test period
fig.add_shape(type='line',
              x0=weekly_auc_test.index.min(), y0=avg_test_auc,
              x1=weekly_auc_test.index.max(), y1=avg_test_auc,
              line=dict(dash='dash', color='red', width=2),
              xref='x', yref='y')

# Add annotations for the averages
fig.add_annotation(x=weekly_auc_train.index.max(), y=avg_train_auc,
                   text=f"Train Avg: {avg_train_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_auc_test.index.max(), y=avg_test_auc,
                   text=f"Test Avg: {avg_test_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")


fig.show()

There is a train-test performance gap, which implies overfitting but that’s not a big concern per se. We really care that the out-of-time performance is good enough, which will be evaluated with the backtest below. Overfitting is normal with gradient-boosted trees model, but its generalization performance is still better than other models like logistic regression or random forests (I will leave this as an exercise to the reader).

Also, note that there is a big drop in the last 3 weeks of the test dataset, implying maybe some kind of drift. That suggests we should not let the model go for more than 6 months without re-training.

Code
# Calculate weekly AUC for training and test sets
weekly_acc_train = acc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_acc_test = acc_over_time(dataset_test, model, 'match_date', 'target', features)

# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
    x=weekly_acc_train.index,
    y=weekly_acc_train.values,
    mode='lines+markers',
    name='Training Set',
    line=dict(color='blue')
)

trace1 = go.Scatter(
    x=weekly_acc_test.index,
    y=weekly_acc_test.values,
    mode='lines+markers',
    name='Test Set',
    line=dict(color='red')
)

layout = go.Layout(
    title='Accuracy Over Time',
    xaxis=dict(title='Week Start Date'),
    yaxis=dict(title='Accuracy'),
    showlegend=True
)

fig = go.Figure(data=[trace0, trace1], layout=layout)

fig.add_hline(y=0.5, line_dash="dash", line_color="black",
              annotation_text="Random prediction", annotation_position="bottom right")

avg_train_acc = weekly_acc_train.mean()
avg_test_acc = weekly_acc_test.mean()

# Training set average line for the training period
fig.add_shape(type='line',
              x0=weekly_acc_train.index.min(), y0=avg_train_acc,
              x1=weekly_acc_train.index.max(), y1=avg_train_acc,
              line=dict(dash='dash', color='blue', width=2),
              xref='x', yref='y')

# Test set average line for the test period
fig.add_shape(type='line',
              x0=weekly_acc_test.index.min(), y0=avg_test_acc,
              x1=weekly_acc_test.index.max(), y1=avg_test_acc,
              line=dict(dash='dash', color='red', width=2),
              xref='x', yref='y')

# Add annotations for the averages
fig.add_annotation(x=weekly_acc_train.index.max(), y=avg_train_acc,
                   text=f"Train Avg: {avg_train_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_acc_test.index.max(), y=avg_test_acc,
                   text=f"Test Avg: {avg_test_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")


fig.show()

Accuracy plot is is similar to AUC in all aspects. Note that we’re much better than predicting at random, but that is not a good baseline here. A much better baseline is the accuracy calculated with the probabilities implied by the betting odds.

Backtesting

Past performance is no guarantee of future results.

Backtesting is replaying the past with your model decisions. One example of thorough backtesting is the following:

  1. Train model with data up to a certain date
  2. Sample betting odds for the next matches
  3. Make bets for those next matches according to your betting strategy
  4. Repeat 1-3 until you cover all the test data
  5. Evaluate ML metrics (e.g. AUC) and business metrics (e.g ROI) on your bets

Backtesting allows us to assess our financial performance, which matters a lot more than ML metrics. For example, is an AUC of 0.77 good or bad? That is hard to tell in general, while a ROI of 1.1 is something we can understand and compare to other strategies (including leaving your money in the bank to earn risk-free interest).

Here, we only assess the ROI of the bets, not other financial metrics like the Sharpe ratio or max drawdown.

For simplicity, we just train the model once and keep it fixed for all future bets, which makes it a more conservative backtest.

First, let’s download the dataset with matches with betting odds:

dataset_with_odds = pd.read_parquet("match_predictions_with_odds.parquet")
dataset_with_odds = dataset_with_odds[["match_id", "team1_odds", "team2_odds"]]
dataset_with_odds = dataset_with_odds.merge(dataset_test, on="match_id")
dataset_with_odds['match_date'] = pd.to_datetime(dataset_with_odds['match_date'])
dataset_with_odds = dataset_with_odds.sort_values(by='match_date')

dataset_with_odds.shape
(1113837, 269)

Now, let’s simulate our betting strategy:

  • For each match, sample just one betting odd at random
  • Only bet if winning probability is over 50% AND
  • Only bet if the probability of winning is greater than the implied probability by the odds plus a delta of 1%
  • The bet can either be a fixed amount or determined by the Kelly criterion (here, for simplicity, I only show fixed betting)

The first premise sounds odd: shouldn’t we pick the best possible betting odd? No for two reasons which apply to real life: 1. For risk management, you don’t want to bet multiple times on the same match 2. You might not be able to bet when you want for multiple reasons (e.g. you are asleep).

The next two points are our betting strategy. There was some trial and error involved in desiging and I’m sure there is room for improvement.

The delta of 1% is our safety margins due to model error. We found this value with a grid search. It’s a parameter you can play with in the simulation below:

MIN_PROBA = 0.5
MIN_DELTA_PROBA = 0.01
N_SIMS = 200

all_samples_data = [] 

for _ in range(N_SIMS):
    df = dataset_with_odds.groupby('match_id').apply(lambda x: x.sample(1)).reset_index(drop=True)
    predict_proba = model.predict_proba(df[features])
    df['team1_proba'] = predict_proba[:, 1]
    df['team2_proba'] = predict_proba[:, 0]
    df["team1_implied_prob"] = 1 / df["team1_odds"]
    df["team2_implied_prob"] = 1 / df["team2_odds"]
    df["team1_bet"] = (df.team1_proba > MIN_PROBA) & (df.team1_proba > (df.team1_implied_prob + MIN_DELTA_PROBA))
    df["team2_bet"] = (df.team2_proba > MIN_PROBA) & (df.team2_proba > (df.team2_implied_prob + MIN_DELTA_PROBA))
    df["team1_returns"] = np.where(df.team1_bet & (df.winner=='team1'), df["team1_odds"], 0.0)
    df["team2_returns"] = np.where(df.team2_bet & (df.winner=='team2'), df["team2_odds"], 0.0)
    df["loss"] = df["team1_bet"].astype(int) + df["team2_bet"].astype(int)
    df["revenue"] = df["team1_returns"] + df["team2_returns"]
    df["profit"] = df["revenue"] - df["loss"]
    all_samples_data.append(df)
Code
all_samples_df = pd.concat(all_samples_data).reset_index(drop=True)
all_samples_df['match_date'] = pd.to_datetime(all_samples_df['match_date'])
all_samples_df.sort_values(by='match_date', inplace=True)
all_samples_df['cumulative_profit'] = all_samples_df.groupby('match_date')['profit'].cumsum()

daily_profit_sum = all_samples_df.groupby('match_date')['profit'].sum().reset_index()
daily_profit_sum['cumulative_profit'] = daily_profit_sum['profit'].cumsum()/N_SIMS

total_profits = all_samples_df['profit'].sum()
total_bets = all_samples_df['loss'].sum()  # This assumes that 'loss' is the number of bets in the all_samples_df
roi = total_profits / total_bets if total_bets > 0 else 0

# Calculate the annualized ROI
min_date = all_samples_df['match_date'].min()
max_date = all_samples_df['match_date'].max()
duration_years = (max_date - min_date) / pd.Timedelta(days=365.25)
annualized_roi = (roi + 1) ** (1 / duration_years) - 1 if duration_years > 0 else 0
print(f"Backtest ROI: {round(roi*100)}%")
print(f"Annualized ROI: {round(annualized_roi*100)}%")
Backtest ROI: 10%
Annualized ROI: 63%

The ROI after 2 months is 10%, which annualized would be 63%, not bad at all! For reference, the risk free interest rate in the US today is around 5% per year, while the average S&P500 returns are roughly 10% a year.

We did have an edge after all. Let’s see the uncertainty across multiple simulations:

Code
# Create a Plotly figure
fig = go.Figure()

# Add traces for each sample's cumulative profits
for sample_data in all_samples_data:
    # Make sure to sort the sample_data by 'match_date'
    sample_data_sorted = sample_data.sort_values(by='match_date')
    fig.add_trace(go.Scatter(
        x=sample_data_sorted['match_date'],
        y=sample_data_sorted['profit'].cumsum(),
        mode='lines',
        line=dict(width=1, color='lightgrey'),
        showlegend=False
    ))

# Add a trace for the average cumulative profits per date
fig.add_trace(go.Scatter(
    x=daily_profit_sum['match_date'],
    y=daily_profit_sum['cumulative_profit'],
    mode='lines',
    name='Avg Cum. Profits',
    line=dict(width=3, color='blue')
))

# Adding ROI text
fig.add_trace(go.Scatter(
    x=[daily_profit_sum['match_date'].iloc[-1] + pd.DateOffset(days=4)],
    y=[daily_profit_sum['cumulative_profit'].iloc[-1]],
    text=[f"ROI: {roi:.2f}"],  # The ROI text
    mode="text",
    showlegend=False,
    textfont=dict(  # Adjust the font properties here
        size=14,
        color='black',
    )
))

# Update layout to add titles and make it more informative
fig.update_layout(
    title="Cumulative Profits over Time with Average",
    xaxis_title="Match Date",
    yaxis_title="Cumulative Profit",
    legend_title="Legend",
    template="plotly_white",
    xaxis=dict(
        type='date'  # Ensure that x-axis is treated as date
    )
)

# Show the figure
fig.show()

There is certain variability, but all lines are still ROI positive. Also, note we’re conservative in the backtest, as the model is static and we pick the betting odds at random.

However, as you saw in my first post, my ROI was actually negative and I did lose 1000 euros. What gives?

Betting strategy

Our actual betting strategy was exactly like our simulation with some additions:

  • I did use the half-Kelly criterion for the betting amounts
  • The model was retrained weekly
  • A slackbot would alert to us to the bets we were supposed to make

Then, each bet was made manually. We considered automating that process, but that risked breaking the terms and conditions of the betting websites, which could lead to a ban or funds freeze, which would be catasthropic losses.

Having to make the bets manually was one of the worst parts of this experience5. Finding time for it was tricky, and multiple good bets were lost due to timing. This is probably one of the biggest reasons why the backtesting was optimistic: There is no edge if you cannot act on it.

Why did I lose money after all?

While we don’t know the precise root cause of the massive bankroll drop we faced in July 2019 (trust me, we tried!), in retrospect, I realise there were many systematic failures which led to those results:

  • Little risk management: besides the backtesting curves, we didn’t check other risk metrics like risk-adjusted returns (e.g. Sharpe ratio) or the maximum drawdown of the strategy - more importantly, I didn’t consider the “unknown unkowns”.
  • No entry strategy: even Ed Thorp with his card counting skills started small and paper-traded for a while before going all in - I came in too strong without any idea whether the system worked in practice or not (spoiler alert: it didn’t).
  • No exit strategy: I decided to quit based on “vibes”, but quitting should also be planned, based on achieved financial goals, stop losses or when your edge becomes dull.
  • Kelly criterion: I used the half-Kelly criterion (see previous blog post) instead of fixed bets, which made me overly aggressive in the beginning even though I had no evidence of having an actual edge.
  • “ML is all you need” fallacy: ML is always just a part of a broader decision-making system and we focused too much on modelling (250 features!) and engineering to do automated feature scraping and Slack alets and not enough on financial decision-making.

Overall, emotions clouded my judgement, I was overconfident and focused too much in ML and too little in finance.

If I had started small, paper-trading or making small fixed amounts, and validated the system end-to-end before increasing the bets, I probably wouldn’t have lost so much money so fast.

My friend, who persisted for a couple of years longer, did have a positive ROI after all, even with those losses, but much less than what our backtesting implied. This means the backtesting was actually optimistic rather than conservative, which is not obvious why. I will leave that question as a take home exercise.

Conclusion

Don’t believe anyone who sells you how to trade or bet (with ML or otherwise). Your edges need to be hard won and only known to you. Be rigorous about your edge: backtesting is not enough, consider the possible unknown unknowns and have a thorough financial strategy including entry and exit plans. Paper trade, simulate different scenarios, start small, and prove your edge before going all in.

In the end, here is what I really took away from it: Making money directly with ML is hard, just find a ML job instead!

Footnotes

  1. For more on this distinction, I recommend reading Statistical Modeling: The Two Cultures by Leo Breiman.↩︎

  2. There is one heterodox view which suggests that yes, should use all features possible if you don’t care about interpretability, only about predictive power. Doing feature selection makes you more exposed to the risk of a few features degrading over time. Any ensemble model that is trained with a feature sample (a standard recommendation) will distribute importance across correlated features, making you less susceptible to one of them failing. If all fail together, it’s no different than depending on just one of them. Caveat emptor: only attempt this if you know what you’re doing (also, your data scientist colleagues will not be happy!).↩︎

  3. I strongly recommend not using correlation between features as a feature selection method. Features can be highly correlated and still provide class separation in some nonlinear settings. No ensemble model will break with perfectly correlated features, unlike linear regression without regularization.↩︎

  4. While the AUC seems like an odd choice given our concern about callibration, which is something AUC doesn’t measure at all, if you can rank-order examples well you can always calibrate your model later with Isotonic regression.↩︎

  5. Losing money, of course, was worse. However, even worse was getting addicted to watching CS:GO matches that I bet on. That was a time waste, anxiety inducing, and made me feel like a gambler not a data scientst.↩︎